Overview: Script for water salinity data from water stations. Sources and information on data below. First I remove data points that are flagged as failed and suspect. Then I use a rolling mean to remove data points +/- 2 standard deviations from the 2hr rolling mean. Finally I calculate hourly mean since data was collected at different frequencies. This will set me up to do analysis between sites and with field data. Sites graphed from cloest to the delta to the mouth of the bay.
Notes on data:
-Accessed and downloaded all data on September 30, 2020 -Downloaded all data available from 1/1/2017-12/31/2019. -China camp http://cdmo.baruch.sc.edu/aqs/zips.cfm Data logged every 15 minutes. Requested citation format: NOAA National Estuarine Research Reserve System (NERRS). System-wide Monitoring Program. Data accessed from the NOAA NERRS Centralized Data Management Office website: http://cdmo.baruch.sc.edu/; accessed 30 September 2020. -Richardson Bay/NERR data. Data logged every 15minutes. -EOS https://oceanview.pfeg.noaa.gov/erddap/tabledap/rtcctdRTCysi.html. Data logged every 6 minutes. time in UTC. Should I use this link? http://erddap.cencoos.org/erddap/tabledap/tiburon-water-tibc1.html. ph and salinity available. No air temperature. -Point Fort http://erddap.cencoos.org/erddap/tabledap/fort-point.html?time%2Csea_water_temperature%2Csea_water_temperature_qc_agg%2Cz&time%3E%3D2020-09-20T21%3A07%3A41Z&time%3C%3D2020-09-29T12%3A00%3A00Z. time in UTC. Data logged every 6 minutes.
Next steps: -fix the aesthetics of the graphs. Font too small, change gradient, clean up keys, ect -analysis with field data -site characterization summary table
Set up
rm(list=ls())
library(tidyverse)
library(ggpubr)
library(scales)
library(chron)
library(plotly)
library(taRifx)
library(aweek)
library(easypackages)
library(renv)
library(here)
library(ggthemes)
library(gridExtra)
library(patchwork)
library(tidyquant)
library(recipes)
library(cranlogs)
library(knitr)
library(openair)
Make a custom function to calculate mean, standard deviation (sd), 95% confidence interval where x is a numeric vector. Here “hi” and “lo” are defined as 2 strandard deviations away from mean. This creates a functin so you don’t need to change anything here/this step does not manipulate the data. You can change the equation to define the hi and low as 3 or however many sds away you would like. na.rm=TRUE removes any “NA” values. “ret” are the new columns it’ll add when you run this on a df.
custom_stat_fun_2<-function(x, na.rm=TRUE){
m<-mean(x, na.rm=TRUE)
s<- sd(x, na.rm=TRUE)
hi<-m+2*s
lo<-m-2*s
ret<-c(mean=m, stdev=s, hi.95=hi, lo.95=lo)
}
China Camp
Read in data. Subset (not needed but it’s a large df). Need a column named “date” with no time stamp for step later.
cc<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/china_camp_all.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)
cc$date<-as.POSIXct(cc$DateTimeStamp, format = c("%m/%d/%Y"))
cc$datetime<-as.POSIXct(cc$DateTimeStamp, format = c("%m/%d/%Y %H:%M"))
cc<-subset(cc, select=c(date, datetime, Sal, F_Sal))
cc%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from China Camp: All data points",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 847 rows containing missing values (geom_point).
Remove data that failed QC test
cc<-cc[- grep("-3", cc$F_Sal),]
cc%>%
ggplot(aes(x=date, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from China Camp: Failed QC points removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 835 rows containing missing values (geom_point).
Still some points I’m not sure about so I’m going to remove data points flagged as “suspect” too
cc<-cc[- grep("1", cc$F_Sal),]
cc%>%
ggplot(aes(x=date, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from China Camp: Failed & suspect QC points removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 835 rows containing missing values (geom_point).
I like this more except did it remove too much data?
Roll apply using custom stat function. Needs a column named “date” with no time stamp in order to work. Select = what paramter/column you want to apply the rolling average to, in this case we’re interested in salinity. FUN= is the custom function created at the beginning. Width of window depends on frequencey of data collection. Data collected every 15 min so width = 672 is 7 days, 1344 is 14 days, 2880 is 30 days. 96 observations per day. 8=2hours.
2 hour
cc<-cc%>%
tq_mutate(
select= Sal,
mutate_fun= rollapply,
width= 8,
align= "right",
by.column= FALSE,
FUN= custom_stat_fun_2,
na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique
Filter to remove calues that are +/- 2 standard deviations from the rolling mean
cc<-filter(cc, Sal <hi.95 & Sal >lo.95)
cc%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from China Camp: Failed QC points removed + +/- 2sd removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 7 rows containing missing values (geom_point).
Hourly median
For timeAverage function to work, you need a date or POSIXct column named “date” but it needs the timestamp to work so we’ll add it back in. Interval = frequency of data collection.
cc<-subset(cc, select=-c(date))
cc$date<-as.POSIXct(cc$datetime, format = c("%m/%d/%Y %H:%M"))
cc.1hr.med<- timeAverage(cc,
avg.time="1 hour",
data.thresh = 0,
statistic = "median",
interval = "15 min",
remove.calm = FALSE,
type = "default")
## Warning: Missing dates detected, removing 7 lines
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
cc.1hr.med<-subset(cc.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))
cc_sal<-cc.1hr.med %>% ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+ylim(0,35)+
geom_hline(yintercept=10, linetype="dashed", color = "red")+
geom_hline(yintercept=20, linetype="dashed", color = "red")+
labs(title="Hourly median salinity from China Camp",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
cc_sal
## Warning: Removed 4563 rows containing missing values (geom_point).
Pretty good overall. I’m just not sure about the segments of data that were removed when I filtered for “suspect” data.
Format and save
cc.1hr.med$date<-as.Date(cc.1hr.med$datetime, format = c("%Y-%m-%d"))
names(cc.1hr.med)[names(cc.1hr.med) == "Sal"] <- "salinity"
write.csv(cc.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/cc_sal.csv")
EOS pier
Set up
eos<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/eos_bouy_r.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)
eos$date<-as.Date(eos$date, format = c("%m/%d/%Y"))
eos$datetime<-as.POSIXct(eos$datetime, format = c("%Y-%m-%dT%H:%M:%SZ"))
eos<-subset(eos, select=c(date, datetime, salinity))
eos%>%
ggplot(aes(x=datetime, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from EOS resesarch pier: All data",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Plot with better view
eos%>%
ggplot(aes(x=datetime, y=salinity, color=salinity))+
geom_point(alpha=0.5)+ylim(0,35)+
labs(title="Salinity data from EOS resesarch pier: All data",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
## Warning: Removed 2 rows containing missing values (geom_point).
Since this data set doesn’t have a QC test column, I’ll remove values that are far outside expected salinity values in Tiburon.
eos<-filter(eos, salinity >0 & salinity <40)
eos%>%
ggplot(aes(x=datetime, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from EOS resesarch pier: Values outside of expected range removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Not sure about the pocket of lower salinity values at the end of 2018. Need to verify. A few other points stand out but they should be removed with the next steps. Corresponds with suspicious water temperature data as well.
Roll apply using custom stat function. Data collected every 6 minutes so width=1680 is 7 days, 3360 is 14 days, 7200 is 30 days. 240 observations per day. 20=2 hours. Need a column named “date” with no time stamp in order to work.
2hr
eos<-eos%>%
tq_mutate(
select= salinity,
mutate_fun= rollapply,
width= 20,
align= "right",
by.column= FALSE,
FUN= custom_stat_fun_2,
na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique
Filter and graph
eos<-filter(eos, salinity <hi.95 & salinity >lo.95)
ggplot(data = eos, mapping = aes(x = datetime, y = salinity, color=salinity)) + geom_point()+
labs(title="Salinity data from EOS resesarch pier: Extreme + +/- 2sd data removed ",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Looks better. Need to verify that low salinity pocket at the end of 2018. Similar pattern in the pH data at this site and time.
Hourly median
eos<-subset(eos, select=-c(date))
names(eos)[names(eos) == "datetime"] <- "date"
eos$date<-as.POSIXct(eos$date, format = c("%Y-%m-%dT%H:%M:%SZ"))
eos.1hr.med<- timeAverage(eos,
avg.time="1 hour",
data.thresh = 0,
statistic = "median",
interval = "6 min",
remove.calm = FALSE,
type = "default")
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
eos.1hr.med<-subset(eos.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))
eos_sal<-eos.1hr.med %>% ggplot(aes(x=date, y=salinity, color=salinity))+
geom_point(alpha=0.5)+ylim(0,35)+
geom_hline(yintercept=10, linetype="dashed", color = "red")+
geom_hline(yintercept=20, linetype="dashed", color = "red")+
labs(title="Hourly median salinity from EOS resesarch pier",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
eos_sal
## Warning: Removed 314 rows containing missing values (geom_point).
Looks good. Need to verify lower pocket at end of 2018.
Format and save
names(eos.1hr.med)[names(eos.1hr.med) == "date"] <- "datetime"
eos.1hr.med$date<-as.Date(eos.1hr.med$datetime, format = c("%Y-%m-%d"))
write.csv(eos.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/eos_sal.csv")
Richardson Bay
Set up
rb<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/richardson_bay_all.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)
rb$date<-as.Date(rb$DateTimeStamp, format = c("%m/%d/%y"))
rb$datetime<-as.POSIXct(rb$DateTimeStamp, format = c("%m/%d/%y %H:%M"))
rb<-subset(rb, select=c(date, datetime, Sal, F_Sal, DateTimeStamp))
rb%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Richardson Bay: All data",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).
Remove flagged data that did not pass QAQC. F_sal with a -3 indicates fail
rb<-rb[- grep("-3", rb$F_Sal),]
rb%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Richardson Bay: Failed QC data removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).
Not sure about that low salinity pocket at the end of 2017 so I’m going to remove data points flagged as “suspect” and see if that helps.
rb<-rb[- grep("1", rb$F_Sal),]
rb%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Richardson Bay: Failed & suspect QC data removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 1508 rows containing missing values (geom_point).
Looks better and got rid of most of the event that I wasn’t sure about. Still some points that I dont’ like. Hopefully the next step helps. Like China Camp, I’m not sure if this removed too much data.
Roll apply using custom stat function. Data collected every 15 min so width = 672 is 7 days, 1344 is 14 days, 2880 is 30 days. 96 observations per day. 2hrs=8
rb<-rb%>%
tq_mutate(
select= Sal,
mutate_fun= rollapply,
width= 8,
align= "right",
by.column= FALSE,
FUN= custom_stat_fun_2,
na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique
Filter
rb<-filter(rb, Sal <hi.95 & Sal >lo.95)
rb%>%
ggplot(aes(x=datetime, y=Sal, color=Sal))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Richardson Bay: Failed & suspect QC data + -/+2sd removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
## Warning: Removed 11 rows containing missing values (geom_point).
Hourly median
rb<-subset(rb, select=-c(date))
rb$date<-as.POSIXct(rb$datetime, format = c("%m/%d/%y %H:%M"))
rb.1hr.med<- timeAverage(rb,
avg.time="1 hour",
data.thresh = 0,
statistic = "median",
interval = "15 min",
remove.calm = FALSE,
type = "default")
## Warning: Missing dates detected, removing 11 lines
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
rb_sal<-rb%>%
ggplot(aes(x=date, y=Sal, color=Sal))+
geom_point(alpha=0.5)+ylim(0,35)+
geom_hline(yintercept=10, linetype="dashed", color = "red")+
geom_hline(yintercept=20, linetype="dashed", color = "red")+
labs(title="Hourly median salinity data from Richardson Bay",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of NERR")
rb_sal
## Warning: Removed 11 rows containing missing values (geom_point).
Format and save
rb.1hr.med<-subset(rb.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))
names(rb.1hr.med)[names(rb.1hr.med) == "Sal"] <- "salinity"
rb.1hr.med$date<-as.Date(rb.1hr.med$date, format = c("%Y-%m-%d"))
write.csv(rb.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/rb_sal.csv")
Fort Point
Set up. For some reason I was getting an error later when I converted the date and datetime column to date columns so I’m only doing one here unlike the other sites but I’m conver the other ones later. The method is still consistent with the other sites.
fp<-read.csv("C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/fort_point_r.csv", header = TRUE, sep=",", fileEncoding="UTF-8-BOM", stringsAsFactors = FALSE)
fp$date<-as.Date(fp$date, format = c("%Y-%m-%d"))
names(fp)[names(fp) == "sea_water_practical_salinity"] <- "salinity"
names(fp)[names(fp) == "sea_water_practical_salinity_qc_agg"] <- "F_Sal"
fp<-subset(fp, select=c(datetime, date, salinity, F_Sal))
fp%>%
ggplot(aes(x=date, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Fort Point: All Data",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Remove flagged data, 4 indicates fail.
fp<-filter(fp, F_Sal!=4)
fp%>%
ggplot(aes(x=date, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Fort Point: Failed QC data removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Still a lot of points I don’t like. 2 hour window did not remove enough outliers. I’m going to also remove points that are flagged as “suspect” and see if it helps.
fp<-filter(fp, F_Sal!=3)
fp%>%
ggplot(aes(x=date, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Fort Point: Failed & suspect QC data removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
This already looks much better. It still has those points around 0 which just do not look right to me but hopefully they are removed with the next few steps.
Roll apply using custom stat function. Data collected every 6 minutes 2hrs = 20 observations.
This is where I was getting an error when I had both the date and datetime column converted so I just converted the date column since that’s what used at this step and will convert the other one when I use it. It didn’t do this before so I’m not sure what’s changed.
fp<-fp%>%
tq_mutate(
select= salinity,
mutate_fun= rollapply,
width= 20,
align= "right",
by.column= FALSE,
FUN= custom_stat_fun_2,
na.rm= TRUE)
## Warning in zoo(y, order.by = index(x), ...): some methods for "zoo" objects do
## not work if the index entries in 'order.by' are not unique
fp<-filter(fp, salinity <hi.95 & salinity >lo.95)
fp%>%
ggplot(aes(x=date, y=salinity, color=salinity))+
geom_point(alpha=0.5)+
labs(title="Salinity data from Fort Point: Failed & suspect QC + +/-2sd + <1 removed",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
Still a few points around 0 lingering.
Hourly median
fp<-subset(fp, select=-c(date))
fp$date<-as.POSIXct(fp$datetime, format = c("%Y-%m-%dT%H:%M:%SZ"))
fp.1hr.med<- timeAverage(fp,
avg.time="1 hour",
data.thresh = 0,
statistic = "median",
interval = "6 min",
remove.calm = FALSE,
type = "default")
## Warning: Missing dates detected, removing 27 lines
## Warning in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
## Detected data with Daylight Saving Time, converting to UTC/GMT
fp_sal<-ggplot(data = fp.1hr.med, mapping = aes(x = date, y = salinity, color=salinity)) + geom_point()+ylim(0,35)+
geom_hline(yintercept=10, linetype="dashed", color = "red")+
geom_hline(yintercept=20, linetype="dashed", color = "red")+
labs(title="Hourly median salinity data from Fort Point",
subtitle="01/01/2017 - 12/31/2019",
caption= "data courtesy of CeNCOOS")
fp_sal
## Warning: Removed 3117 rows containing missing values (geom_point).
There’s still a few points around 0 that don’t look right. Should I remove them earlier so they don’t affect the hourly mean?
Format and save
fp.1hr.med<-subset(fp.1hr.med, select=(-c(mean, stdev, hi.95, lo.95)))
names(fp.1hr.med)[names(fp.1hr.med) == "date"] <- "datetime"
fp.1hr.med$date<-as.Date(fp.1hr.med$datetime, format = c("%Y-%m-%d"))
write.csv(fp.1hr.med, "C:/Users/chels/Box Sync/Thesis/Data/Working data/Bouy data/fp_sal.csv")
Salinity graphs together
all_sal<-ggarrange(cc_sal, eos_sal, rb_sal, fp_sal, nrow=4, ncol=1)
## Warning: Removed 4563 rows containing missing values (geom_point).
## Warning: Removed 314 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_point).
## Warning: Removed 3117 rows containing missing values (geom_point).
all_sal